MACHINE LEARNING PROJECT - CREDIT CARD APPROVAL PREDICTION¶

It is critical for any company to understand risk and then minimize it. A credit card company gathers a lot of information in order to accomplish this. The "risk" in this case is the likelihood of any applicant defaulting on credit card borrowings. Thanks to a dataset downloaded from Kaggle I want to test my understanding of machine learning techniques, and try to build a model capable of classify any applicant as "good" or "bad" (else low and high risk).

I have a lot of historical and personal data, but I don't have a good "target" variable to predict. The first part of the project involves investigating the credit record and developing an algorithm to generate a "HIGH RISK" target feature, which is a categorical variable where a "high risk" applicant is anyone who has, at least once, failed to pay any debts within 60 days.

After figuring out how to divide all the data into the two risk categories, the next step will be to bring the applicant data and the target variable together in one dataframe. Then a proper statistical analysis will be put in place in order to understand if there is a common pattern among the low and high-risk credit card users. But before that, the dataframe must be divided into a training set and a testing set. The test set will prove itself useful later.

Once know the data, the project can really start. First, we are going to use our knowledge gained from the EDA in order to find the right way to clean and polish our dataframe. Which feature to drop, which to tweak (by fixing the skewness of their distributions, by removing outliers, or by normalizing them). A list of models will be trained on the preprocessed data, and the respective performances will be compared in order to find the model best suited for our goal.

Finally, our best model will be used on our unseen test data, to find out how well it can perform.

Table of contents

  • 0 - Importing libraries, defining functions
  • 1 - loading the data
  • 2 - Exploratory data analysis
    • 2.1 - Univariate analysis
    • 2.2 - Bivariate analysis
    • 2.3 - Chi-square test
    • 2.4 - Conclusion
  • 3 - Machine Learning
    • 3.1 - Data cleaning and preprocessing
    • 3.2 - Building and testing promising models
    • 3.3 - How to chose the right model
  • 4 - Final Result

0.1 - IMPORTING LIBRARIES ¶

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 200)
pd.options.display.float_format = '{:0,.3f}'.format
from pandas.core.common import SettingWithCopyWarning
import warnings
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning, FutureWarning))
import operator
from pandas_profiling import ProfileReport
import numpy as np
import missingno
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="white", palette=None)
from scipy.stats import chi2_contingency
import scipy.stats as stats
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, OrdinalEncoder
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, roc_curve
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from xgboost import XGBClassifier
import scikitplot as skplt
from yellowbrick.model_selection import FeatureImportances
from imblearn.over_sampling import SMOTE
import joblib
import os
%matplotlib inline

0.2 - FUNCTIONS¶

All functions are grouped at the start in order to make the notebook more readable.

In [2]:
#Function to split the data into train and test sets
def data_split(df, test_size):
     train_df, test_df = train_test_split (df, test_size = test_size, random_state = 42)
     return train_df.reset_index(drop=True), test_df.reset_index(drop=True)

#Function to 'enhance' pandas .describe, by adding skew and kurtosis
def describe(df, stats):
    d = df.describe(include='all', percentiles=[0.25,0.5,0.75,0.99])
    return d.append(df.reindex(d.columns, axis = 1).agg(stats))

#Function that will return value count and frequency for each feature
def value_count(feature):
    ftr_value_count = eda_df[feature].value_counts()
    ftr_freq = eda_df[feature].value_counts(normalize=True) * 100
    ftr_concat = pd.concat([ftr_value_count, ftr_freq], axis=1)
    ftr_concat.columns = ['Count', 'Frequency (%)']
    return ftr_concat
def value_count_high(feature):
    ftr_value_count = high_df[feature].value_counts()
    ftr_freq = high_df[feature].value_counts(normalize=True) * 100
    ftr_concat = pd.concat([ftr_value_count, ftr_freq], axis=1)
    ftr_concat.columns = ['Count', 'Frequency (%)']
    return ftr_concat

#Function to call the describe function for a specific feature
def gen_info(feature):
    match feature:
        case 'AGE'|'EMPLOYMENT LENGHT'|'ACCOUNT AGE'|'ANNUAL INCOME':
            print('*'*55)
            print('Description:\n{}'.format(eda_df[feature].describe()))
            print('*'*55)
        case _:
            print('*'*55)
            print('Description:\n{}'.format(eda_df[feature].describe()))
            print('*'*55)
            x = value_count(feature)
            print(f'Value count:\n{x}')
            print('*'*55)       

#Function to call the describe function for a specific feature, for high risk applicants
def high_info(feature):
    match feature:
        case 'AGE'|'EMPLOYMENT LENGHT'|'ACCOUNT AGE'|'ANNUAL INCOME':
            print('*'*55)
            print('Description:\n{}'.format(high_df[feature].describe()))
            print('*'*55)
        case _:
            print('*'*55)
            print('Description:\n{}'.format(high_df[feature].describe()))
            print('*'*55)
            x = value_count_high(feature)
            print(f'Value count:\n{x}')
            print('*'*55)

#Function to draw a bar plot
def draw_bar_plot(feature):
    match feature:
        case 'GENDER'|'HAS A CAR'|'OWNS REAL ESTATE'|'HAS A MOBILE PHONE'|'HAS A WORK PHONE'|'HAS A PHONE'|'HAS AN EMAIL'|'HIGH RISK':
            sns.set(rc={'figure.figsize':(5, 10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.barplot(y=value_count(feature).values[:,0],x=value_count(feature).index)
            plt.title(f'{feature} COUNT', fontweight='bold')
            return plt.show()
        case 'OCCUPATION':
            sns.set(rc={'figure.figsize':(20, 10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.barplot(x=value_count(feature).values[:,0],y=value_count(feature).index)
            plt.title(f'{feature} COUNT', fontweight='bold')
            return plt.show()
        case _:    
            sns.set(rc={"figure.figsize":(20, 10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.barplot(x=value_count(feature).index,y=value_count(feature).values[:,0])
            plt.title(f'{feature} COUNT', fontweight='bold')
            return plt.show()

#Function to draw a box plot
def draw_box_plot(feature):
    match feature:    
        case 'ANNUAL INCOME':
            sns.set(rc={'figure.figsize':(5,10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.boxplot(y=eda_df[feature])
            plt.title(f'{feature} FEATURE', fontweight='bold')
            #remove scientific notation
            plt.ticklabel_format(style='plain', axis='y')
            return plt.show()
        case _:
            sns.set(rc={'figure.figsize':(5,10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.boxplot(y=eda_df[feature])
            plt.title(f'{feature} FEATURE', fontweight='bold')
            return plt.show()

#Function to draw a hist plot
def draw_hist_plot(feature):
    match feature:
        case 'ANNUAL INCOME':
            sns.set(rc={"figure.figsize":(20, 10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.histplot(eda_df[feature], bins=49, kde=True)            
            plt.title(f'{feature} DISTRIBUTION', fontweight='bold')
            #remove scientific notation
            plt.ticklabel_format(style='plain', axis='x')
            return plt.show()
        case _:
            sns.set(rc={"figure.figsize":(20, 10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.histplot(eda_df[feature], bins=49, kde=True)
            plt.title(f'{feature} DISTRIBUTION', fontweight='bold')
            return plt.show()

#Function to draw box plot to compare High Risk vs Low Risk
def high_low_box_plot(feature):
    match feature:
        case 'ANNUAL INCOME':
            sns.set(rc={'figure.figsize':(8,10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.boxplot(y=eda_df[feature], x=eda_df['HIGH RISK'])
            #plt.ticklabel_format(style='plain', axis='y')
            plt.title(f'HIGH AND LOW RISK ON {feature}', fontweight='bold')
        case _:    
            sns.set(rc={'figure.figsize':(8,10)}, style='darkgrid')
            sns.set_theme(style="dark", palette='Set2')
            sns.boxplot(y=eda_df[feature], x=eda_df['HIGH RISK'])
            plt.title(f'HIGH AND LOW RISK ON {feature}', fontweight='bold')
            return plt.show()

#Function to test statistically the influence of each feature on target
def chi_square_test(dict, feature):
    HighRisk_feature = train_og_copy[train_og_copy['HIGH RISK']==1][feature]
    cross = pd.crosstab(index=HighRisk_feature, columns=['Count']).rename_axis(None, axis=1)
    cross.index.name = None
    # observe values
    obs = cross
    print('*'*20 + ' ' + feature + ' ' +'*'*20)
    print('Observed values:\n')
    print(obs)
    #expected values
    exp = pd.DataFrame([obs['Count'].sum()/len(obs)] * len(obs.index),columns=['Count'], index=obs.index)
    print('*'*55)
    print('Expected values:\n')
    print(exp)
    print('\n')
    # chi-square test
    chi_squared_stat = (((obs-exp)**2)/exp).sum()
    print('Chi-square:\n')
    print(chi_squared_stat[0])
    print('\n')
    x=float(chi_squared_stat)
    dict[feature]=x
    #critical value (99% confidence level)
    q = 0.99
    crit = stats.chi2.ppf(q = q, df = len(obs) - 1)
    print('Critical value:\n')
    print(crit)
    print('\n')
    # p-value
    p_value = 1 - stats.chi2.cdf(x = chi_squared_stat, df = len(obs) - 1)
    print('P-value:\n')
    print(p_value)
    print('\n')
    if p_value <= (1-q):
        print(f'WE REJECT THE NULL HYPOTHESIS: THE FEATURE "{feature}" HAS EFFECT ON TARGET')
        print('\n')
    elif p_value > (1-q):
        print(f'WE ACCEPT THE NULL HYPOTHESIS: THE FEATURE "{feature}" HAS NO EFFECT ON TARGET')
    return dict

#Function to remove outliers
class RemoveOutliers(BaseEstimator,TransformerMixin):
    def __init__(self,out_feature = ['FAMILY SIZE','ANNUAL INCOME', 'EMPLOYMENT LENGHT']):
        self.out_feature = out_feature
    def fit(self,df):
        return self
    def transform(self,df):
        if (set(self.out_feature).issubset(df.columns)):
            # 25% quantile
            Q1 = df[self.out_feature].quantile(.25)
            # 75% quantile
            Q3 = df[self.out_feature].quantile(.75)
            IQR = Q3 - Q1
            # keep the data within 3 IQR
            df = df[~((df[self.out_feature] < (Q1 - 3 * IQR)) |(df[self.out_feature] > (Q3 + 3 * IQR))).any(axis=1)]
            return df
        else:
            print("One or more features are not in the dataframe")
            return df

#Function to drop useless or correlated features
class DropFeatures(BaseEstimator,TransformerMixin):
    def __init__(self,drop_feature = ['ID','# CHILDREN','HAS A MOBILE PHONE','OCCUPATION','ACCOUNT AGE']):
        self.drop_feature = drop_feature
    def fit(self,df):
        return self
    def transform(self,df):
        if (set(self.drop_feature).issubset(df.columns)):
            df.drop(self.drop_feature,axis=1,inplace=True)
            return df
        else:
            print("One or more features are not in the dataframe")
            return df

#One Hot Encoding
class OneHotEncoding(BaseEstimator,TransformerMixin):
    def __init__(self, one_hot_feature = ['GENDER','HAS A CAR','OWNS REAL ESTATE','INCOME TYPE','FAMILY STATUS','RESIDENCE TYPE','HAS A PHONE','HAS A WORK PHONE','HAS AN EMAIL']):
        self.one_hot_feature = one_hot_feature
    def fit(self,df):
        return self
    def transform(self,df):
        if (set(self.one_hot_feature).issubset(df.columns)):
            #Function that actually encode the feature
            def one_hot_encoding(df,one_hot_feature):
                one_hot_encoding = OneHotEncoder()
                one_hot_encoding.fit(df[one_hot_feature])
                one_hot_feature_names = one_hot_encoding.get_feature_names_out(one_hot_feature)
                df = pd.DataFrame(one_hot_encoding.transform(df[self.one_hot_feature]).toarray(),columns=one_hot_feature_names,index=df.index)
                return df
            def concat_with_df(df,one_hot_encoding_df,one_hot_feature):
                rest_of_features = [ft for ft in df.columns if ft not in one_hot_feature]
                df_concat = pd.concat([one_hot_encoding_df, df[rest_of_features]],axis=1)
                return df_concat
            one_hot_encoding_df = one_hot_encoding(df,self.one_hot_feature)
            full_df = concat_with_df(df,one_hot_encoding_df,self.one_hot_feature)
            return full_df
        else:
            print("One or more features are not in the dataframe")
            return df

#Min-Max Scaler
class MinMax(BaseEstimator,TransformerMixin):
    def __init__(self,min_max_feature=['ANNUAL INCOME','AGE','EMPLOYMENT LENGHT','FAMILY SIZE']):
        self.min_max_feature=min_max_feature
    def fit(self,df):
        return self
    def transform(self,df):
        if (set(self.min_max_feature).issubset(df.columns)):
            min_max_encoding = MinMaxScaler()
            df[self.min_max_feature] = min_max_encoding.fit_transform(df[self.min_max_feature])
            return df
        else:
            print("One or more features are not in the dataframe3")
            return df           

#Fix Skewness
class FixSkewness(BaseEstimator,TransformerMixin):
    def __init__(self,skew_feature=['ANNUAL INCOME','AGE','EMPLOYMENT LENGHT','FAMILY SIZE']):
        self.skew_feature=skew_feature
    def fit(self,df):
        return self
    def transform(self,df):
        if (set(self.skew_feature).issubset(df.columns)):
            #Since all of our feature have positive skew, we are going to use a cubic root transformation
            df[self.skew_feature] = np.cbrt(df[self.skew_feature])
            return df
        else:
            print("One or more features are not in the dataframe")
            return df

#Ordinal Encoding
class OrdinalEnc(BaseEstimator,TransformerMixin):
    def __init__(self,ordinal_feature=['EDUCATION']):
        self.ordinal_feature=ordinal_feature
    def fit(self,df):
        return self
    def transform(self,df):
        if 'EDUCATION' in df.columns:
            ord_enc = OrdinalEncoder()
            df[self.ordinal_feature]=ord_enc.fit_transform(df[self.ordinal_feature])
            return df
        else:
            print("One or more features are not in the dataframe")
            return df

#Function to change target dtype to 'numeric'
class ChangeToNumTarget(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass
    def fit(self,df):
        return self
    def transform(self,df):
        if 'HIGH RISK' in df.columns:
            df['HIGH RISK'] = pd.to_numeric(df['HIGH RISK'])
            return df
        else:
            print("Is high risk is not in the dataframe")
            return df

#Oversampling
class Oversample(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass
    def fit(self,df):
        return self
    def transform(self,df):
        if 'HIGH RISK' in df.columns:
            # smote function to oversample the minority class to fix the imbalance data
            oversample = SMOTE(sampling_strategy='minority')
            X_bal, y_bal = oversample.fit_resample(df.loc[:, df.columns != 'HIGH RISK'],df['HIGH RISK'])
            df_bal = pd.concat([pd.DataFrame(X_bal),pd.DataFrame(y_bal)],axis=1)
            return df_bal
        else:
            print("HIGH RISK is not in the dataframe")
            return df

#Function to call the pipeline dedicated to clean the data
def DataPreprocessing(df):
    pipeline=Pipeline([
        ('RemoveOutilers', RemoveOutliers()),
        ('DropFeature', DropFeatures()),
        ('FixSkewness', FixSkewness()),
        ('OneHot', OneHotEncoding()),
        ('Ordinal', OrdinalEnc()),
        ('MinMax', MinMax()),
        ('Numeric', ChangeToNumTarget()),
        ('Oversampling', Oversample()),
    ])
    df_pipe_prep=pipeline.fit_transform(df)
    return df_pipe_prep

#Function to get the feature importance of the classifier, and plot it
def feature_importance_plot(model, model_name):
    if model_name not in ['sgd','gaussian_naive_bayes','k_nearest_neighbors','bagging']:
        # top 10 most predictive features
        top_10_feat = FeatureImportances(model, relative=False, topn=10)
        # top 10 least predictive features
        bottom_10_feat = FeatureImportances(model, relative=False, topn=-10)
        #change the figure size
        plt.figure(figsize=(10, 4))
        #change x label font size
        plt.xlabel('xlabel', fontsize=14)
        # Fit to get the feature importances
        top_10_feat.fit(X_train, y_train)
        # show the plot
        top_10_feat.show()
        print('\n')
        plt.figure(figsize=(10, 4))
        plt.xlabel('xlabel', fontsize=14)
        # Fit to get the feature importances
        bottom_10_feat.fit(X_train, y_train)
        # show the plot
        bottom_10_feat.show()
        print('\n')
    else:
        print(f'No feature importance for {model_name}')
        print('\n')

#Function to get the y prediction
def y_prediction(model,model_name,final_model=False):
    if final_model == False:
        # check if y_train_copy_pred exists, if not create it
        y_train_pred_path = Path(f'saved_models/{model_name}/y_train_copy_pred_{model_name}.sav')
        try:
            y_train_pred_path.resolve(strict=True)
        except FileNotFoundError:
            #cross validation prediction with kfold = 10
            y_cc_train_pred = cross_val_predict(model,X_train,y_train,cv=10,n_jobs=-1)
            #save the predictions
            joblib.dump(y_cc_train_pred,y_train_pred_path)
            return y_cc_train_pred
        else:
            # if it exist load the predictions
            y_cc_train_pred = joblib.load(y_train_pred_path)
            return y_cc_train_pred
    else:
        # check if y_train_copy_pred exists, if not create it
        y_train_pred_path_final = Path(f'saved_models_final/{model_name}/y_train_copy_pred_{model_name}_final.sav')
        try:
            y_train_pred_path_final.resolve(strict=True)
        except FileNotFoundError:
            #cross validation prediction with kfold = 10
            y_cc_train_pred_final = cross_val_predict(model,X_train,y_train,cv=10,n_jobs=-1)
            #save the predictions
            joblib.dump(y_cc_train_pred_final,y_train_pred_path_final)
            return y_cc_train_pred_final
        else:
            # if it exist load the predictions
            y_cc_train_pred_final = joblib.load(y_train_pred_path_final)
            return y_cc_train_pred_final

#Function to plot the confusion matrix
def confusion_matrix(model,model_name,final_model=False):
    if final_model == False:
        fig, ax = plt.subplots(figsize=(8,8))
        #plot confusion matrix
        conf_matrix = ConfusionMatrixDisplay.from_predictions(y_train,y_prediction(model,model_name),ax=ax, cmap='GnBu',values_format='d')
        # remove the grid
        ax.grid(False)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.grid(visible=None, axis='both')
        # increase the font size of the x and y labels
        plt.xlabel('Predicted label', fontsize=14)
        plt.ylabel('True label', fontsize=14)
        #give a title to the plot using the model name
        plt.title('Confusion Matrix', fontsize=14)
        #show the plot
        plt.show()
        print('\n')
    else:
        fig, ax = plt.subplots(figsize=(8,8))
        #plot confusion matrix
        conf_matrix_final = ConfusionMatrixDisplay.from_predictions(y_train,y_prediction(model,model_name,final_model=True),ax=ax, cmap='GnBu',values_format='d')
        # remove the grid
        ax.grid(False)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.grid(visible=None)
        # increase the font size of the x and y labels
        plt.xlabel('Predicted label', fontsize=14)
        plt.ylabel('True label', fontsize=14)
        #give a title to the plot using the model name
        plt.title('Confusion Matrix', fontsize=14)
        #show the plot
        plt.show()
        print('\n')

#Function to plot the roc curve
def roc_curve(model,model_name,final_model=False):
    if final_model == False:
        # check if y probabilities file exists, if not create it
        y_proba_path = Path(f'saved_models/{model_name}/y_cc_train_proba_{0}.sav')
        try:
            y_proba_path.resolve(strict=True)
        except FileNotFoundError:
            y_train_proba = model.predict_proba(X_train)
            joblib.dump(y_train_proba,y_proba_path)
        else:
            # if path exist load the y probabilities file
            y_train_proba = joblib.load(y_proba_path)
        skplt.metrics.plot_roc(y_train, y_train_proba, title = f'ROC curve for {model_name}', cmap='cool',figsize=(8,6), text_fontsize='large')
        #remove the grid
        plt.grid(visible=None)
        plt.show()
        print('\n')
    else:
        # check if y probabilities file exists, if not create it
        y_proba_path_final = Path(f'saved_models_final/{model_name}/y_cc_train_proba_{model_name}_final.sav')
        try:
            y_proba_path_final.resolve(strict=True)
        except FileNotFoundError:
            y_train_proba_final = model.predict_proba(X_train)
            joblib.dump(y_train_proba_final,y_proba_path_final)
        else:
            # if path exist load the y probabilities file
            y_train_proba_final = joblib.load(y_proba_path_final)
        skplt.metrics.plot_roc(y_train, y_train_proba_final, title = f'ROC curve for {model_name}', cmap='cool',figsize=(8,6), text_fontsize='large')
        #remove the grid
        plt.grid(visible=None)
        plt.show()
        print('\n')

#Function to display the classification report
def score(model, model_name, final_model=False):
    if final_model == False:
        class_report = classification_report(y_train,y_prediction(model,model_name))
        print(class_report)
    else:
        class_report_final = classification_report(y_train,y_prediction(model,model_name,final_model=True))
        print(class_report_final)

#Function to train the model
def train_model(model,model_name,final_model=False):
    # if we are not training the final model
    if final_model == False:
        # check if the model file exist and if not create, train and save it
        model_file_path = Path(f'saved_models/{model_name}/{model_name}_model.sav')
        try:
            model_file_path.resolve(strict=True)
        except FileNotFoundError:
            if model_name == 'sgd':
                # for sgd, loss = 'hinge' does not have a predict_proba method. Therefore, we use a calibrated model
                calibrated_model = CalibratedClassifierCV(model, cv=10, method='sigmoid')
                model_trn = calibrated_model.fit(X_train,y_train)
            else:
                model_trn = model.fit(X_train,y_train)
            joblib.dump(model,model_file_path)
            # plot the most and least predictive features
            return model_trn
        else:
            # if path exist load the model
            model = joblib.load(model_file_path)
            # plot the most and least predictive features
            return model
    else:
        # check if the final model file exist and if not create, train and save it
        final_model_file_path = Path(f'saved_models_final/{model_name}/{model_name}_model.sav')
        try:
            final_model_file_path.resolve(strict=True)
        except FileNotFoundError:
            model = model.fit(X_train,y_train)
            joblib.dump(model,final_model_file_path)
            # plot the most and least predictive features
            return model
        else:
            # if path exist load the model
            model = joblib.load(final_model_file_path)
            # plot the most and least predictive features
            return model

#Function to check if the folder for saving the model exists, if not create it
def folder_check():
    if not os.path.exists(f'saved_models/{model_name}'):
        os.makedirs(f'saved_models/{model_name}')

1. IMPORT THE DATA ¶

In [4]:
#Read Csv Files
app_df = pd.read_csv('application_record.csv')
record_df = pd.read_csv('credit_record.csv')
In [5]:
#Flag as Risky account everyone that were at least once, past due over 30 days
record_df['High Risk'] = np.nan
record_df['High Risk'][record_df['STATUS'] == '2']='Yes'
record_df['High Risk'][record_df['STATUS'] == '3']='Yes'
record_df['High Risk'][record_df['STATUS'] == '4']='Yes'
record_df['High Risk'][record_df['STATUS'] == '5']='Yes'
In [6]:
#Group data by account ID
record_df = record_df.groupby('ID').count()
#Number of months with past due payments
record_df['PAST DUE MONTHS'] = record_df['High Risk']
record_df = record_df.rename(columns={'MONTHS_BALANCE':'ACCOUNT AGE'})
In [7]:
#Flag each account that were at least once past due
record_df['X'] = None
record_df['X'][record_df['High Risk'] == 0] = 0
record_df['X'][record_df['High Risk'] > 0] = 1
record_df.drop(['High Risk','STATUS'], axis=1, inplace=True)
record_df = record_df.rename(columns={'X':'HIGH RISK'})
In [8]:
#We are only interested in account age, risk ratio and the categorical variable 'High Risk'
record_df = record_df[['ACCOUNT AGE', 'HIGH RISK']]
In [9]:
#Merging data with applicants df, from now on we are working with this
app_df = pd.merge(app_df, record_df, how='inner', on='ID')
In [10]:
#Better columns name
app_df = app_df.rename(columns={'CODE_GENDER':'GENDER', 'FLAG_OWN_CAR':'HAS A CAR', 'CNT_CHILDREN':'# CHILDREN'})
app_df = app_df.rename(columns={'AMT_INCOME_TOTAL':'ANNUAL INCOME', 'NAME_INCOME_TYPE':'INCOME TYPE', 'NAME_EDUCATION_TYPE':'EDUCATION'})
app_df = app_df.rename(columns={'NAME_FAMILY_STATUS':'FAMILY STATUS', 'NAME_HOUSING_TYPE':'RESIDENCE TYPE', 'DAYS_BIRTH':'AGE'})
app_df = app_df.rename(columns={'DAYS_EMPLOYED':'EMPLOYMENT LENGHT', 'FLAG_MOBIL':'HAS A MOBILE PHONE', 'FLAG_WORK_PHONE':'HAS A WORK PHONE'})
app_df = app_df.rename(columns={'FLAG_PHONE':'HAS A PHONE', 'FLAG_EMAIL':'HAS AN EMAIL', 'OCCUPATION_TYPE':'OCCUPATION'})
app_df = app_df.rename(columns={'CNT_FAM_MEMBERS':'FAMILY SIZE', 'FLAG_OWN_REALTY':'OWNS REAL ESTATE'})
In [11]:
#Fix applicant's age, employment age and categorical features
app_df['AGE'] = np.trunc(np.abs(app_df['AGE']/365)).astype(np.int64)
#Pensioner Fix. Instead of 1000, let's assume that maximum amount of working years is 50
app_df['EMPLOYMENT LENGHT'] = np.trunc(np.abs(app_df['EMPLOYMENT LENGHT']/365)).astype(np.int64)
app_df = app_df.replace({'EMPLOYMENT LENGHT':{1000:50}})
app_df = app_df.replace({'HAS A MOBILE PHONE':{0:'N',1:'Y'}})
app_df = app_df.replace({'HAS A WORK PHONE':{0:'N',1:'Y'}})
app_df = app_df.replace({'HAS A PHONE':{0:'N',1:'Y'}})
app_df = app_df.replace({'HAS AN EMAIL':{0:'N',1:'Y'}})
app_df['FAMILY SIZE'] = app_df['FAMILY SIZE'].astype(np.int64)
In [12]:
train_og, test_og = data_split(app_df, 0.3)
#saving train and test sets
train_og.to_csv('dataset/train.csv', index=False)
test_og.to_csv('dataset/test.csv', index=False)
#creating a backup
train_og_copy = train_og.copy()
test_og_copy = test_og.copy()

2. EXPLORATORY DATA ANALYSIS ¶

In [13]:
#We are going to take a look only on train data
eda_df = train_og
eda_df = eda_df.replace({'HIGH RISK':{0:'N', 1:'Y'}})
high_df = eda_df[eda_df['HIGH RISK'] == 'Yes']
In [14]:
#Bird's eye view on our dataframe
describe(eda_df, ['skew', 'kurt'])
Out[14]:
ID GENDER HAS A CAR OWNS REAL ESTATE # CHILDREN ANNUAL INCOME INCOME TYPE EDUCATION FAMILY STATUS RESIDENCE TYPE AGE EMPLOYMENT LENGHT HAS A MOBILE PHONE HAS A WORK PHONE HAS A PHONE HAS AN EMAIL OCCUPATION FAMILY SIZE ACCOUNT AGE HIGH RISK
count 25,519.000 25519 25519 25519 25,519.000 25,519.000 25519 25519 25519 25519 25,519.000 25,519.000 25519 25519 25519 25519 17589 25,519.000 25,519.000 25519
unique NaN 2 2 2 NaN NaN 5 5 5 6 NaN NaN 1 2 2 2 18 NaN NaN 2
top NaN F N Y NaN NaN Working Secondary / secondary special Married House / apartment NaN NaN Y N N N Laborers NaN NaN N
freq NaN 17117 15876 17159 NaN NaN 13145 17266 17575 22820 NaN NaN 25519 19801 17980 23207 4402 NaN NaN 25082
mean 5,078,278.001 NaN NaN NaN 0.433 187,022.766 NaN NaN NaN NaN 43.290 14.103 NaN NaN NaN NaN NaN 2.201 21.275 NaN
std 41,788.362 NaN NaN NaN 0.747 101,869.023 NaN NaN NaN NaN 11.512 17.257 NaN NaN NaN NaN NaN 0.915 14.870 NaN
min 5,008,805.000 NaN NaN NaN 0.000 27,000.000 NaN NaN NaN NaN 21.000 0.000 NaN NaN NaN NaN NaN 1.000 1.000 NaN
25% 5,042,112.500 NaN NaN NaN 0.000 121,500.000 NaN NaN NaN NaN 34.000 3.000 NaN NaN NaN NaN NaN 2.000 9.000 NaN
50% 5,074,692.000 NaN NaN NaN 0.000 157,500.000 NaN NaN NaN NaN 42.000 6.000 NaN NaN NaN NaN NaN 2.000 18.000 NaN
75% 5,114,615.500 NaN NaN NaN 1.000 225,000.000 NaN NaN NaN NaN 53.000 15.000 NaN NaN NaN NaN NaN 3.000 31.000 NaN
99% 5,149,808.820 NaN NaN NaN 3.000 560,250.000 NaN NaN NaN NaN 66.000 50.000 NaN NaN NaN NaN NaN 5.000 59.000 NaN
max 5,150,482.000 NaN NaN NaN 19.000 1,575,000.000 NaN NaN NaN NaN 68.000 50.000 NaN NaN NaN NaN NaN 20.000 61.000 NaN
skew 0.082 NaN NaN NaN 2.703 2.748 NaN NaN NaN NaN 0.183 1.388 NaN NaN NaN NaN NaN 1.373 0.731 NaN
kurt -1.207 NaN NaN NaN 26.214 17.870 NaN NaN NaN NaN -1.044 0.307 NaN NaN NaN NaN NaN 9.649 -0.382 NaN

"OCCUPATION" column is the only one that counts fewer rows compared to the rest of the dataset, which implies that this feature comes with null values, so we need to check that.

We can already obtain some basic information about our applicants, but first we need to study the distribution of each numerical variable.

The skewness is positive, meaning that every distribution has a tail on the right. 'AGE' and 'ACCOUNT AGE' are the ones with the skewness closer to zero, which indicates that those are the ones closest to normality.

Kurtosis is positive and has a high value for the number of children, income, and family size. Those features are probably affected by outliers.


In [15]:
#We shouldn't have any NaNs exept for the 'occupation' column. Let's check it.
missingno.matrix(eda_df, color=(1, 0.38, 0.27));

This confirms our suspicion. 'OCCUPATION' features are the only ones with NaN, which probably refers to people without a job. Since it is not a numerical feature, it could be problematic to fix this.


In [16]:
eda_df[eda_df['OCCUPATION'].isnull()].head(10)
Out[16]:
ID GENDER HAS A CAR OWNS REAL ESTATE # CHILDREN ANNUAL INCOME INCOME TYPE EDUCATION FAMILY STATUS RESIDENCE TYPE AGE EMPLOYMENT LENGHT HAS A MOBILE PHONE HAS A WORK PHONE HAS A PHONE HAS AN EMAIL OCCUPATION FAMILY SIZE ACCOUNT AGE HIGH RISK
10 5068319 F N Y 0 189,000.000 Pensioner Higher education Separated House / apartment 62 50 Y N N N NaN 1 3 N
14 5067680 F N Y 0 90,000.000 Pensioner Secondary / secondary special Married House / apartment 59 50 Y N N N NaN 2 3 N
15 5111102 M Y Y 0 351,000.000 Pensioner Secondary / secondary special Married House / apartment 61 50 Y N N N NaN 2 60 N
16 5054237 F N N 0 135,000.000 Pensioner Higher education Single / not married House / apartment 57 50 Y N Y N NaN 1 15 N
20 5022267 M Y Y 0 131,400.000 Pensioner Secondary / secondary special Married House / apartment 64 50 Y N N N NaN 2 9 N
24 5096839 F N Y 0 94,500.000 Pensioner Higher education Single / not married House / apartment 61 50 Y N N N NaN 1 43 N
26 5037243 F N Y 0 121,500.000 Pensioner Secondary / secondary special Married House / apartment 65 50 Y N N N NaN 2 14 N
29 5145887 F N Y 0 72,000.000 Pensioner Secondary / secondary special Married House / apartment 60 50 Y N N N NaN 2 4 N
33 5021720 F Y Y 1 630,000.000 Working Secondary / secondary special Married House / apartment 48 1 Y N N N NaN 3 13 N
35 5105412 F N Y 2 135,000.000 Working Secondary / secondary special Married House / apartment 34 4 Y N N N NaN 4 26 N
In [17]:
x = eda_df['INCOME TYPE'].unique()
print(f'Income types on our df are {x}')
Income types on our df are ['Working' 'Commercial associate' 'State servant' 'Pensioner' 'Student']

There are no jobless applicants (which makes sense given that the minimum income value isn't 0).
IDs without an occupation are entries for people of various ages, income levels, and educational levels.
It could be a simple data entry error, and since we already have a lot of information about each applicant, we can just delete this column.
We'll make that decision later in the analysis.


In [18]:
#EDA in html
profile_report = ProfileReport(eda_df, explorative=True, dark_mode=False)
profile_report_file_path = Path('pandas_profile_file/credit_pred_profile.html')
try:
    profile_report_file_path.resolve(strict=True)
except FileNotFoundError:
    profile_report.to_file("pandas_profile_file/credit_pred_profile.html")

It is possible to create an interactive report that is saved as an html page using a handy python library. It is possible to consult it here.


2.1 - UNIVARIATE ANALYSIS ¶

In [19]:
feature = 'GENDER'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        2
top           F
freq      17117
Name: GENDER, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
F  17117         67.076
M   8402         32.924
*******************************************************
In [20]:
feature = 'HAS A CAR'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        2
top           N
freq      15876
Name: HAS A CAR, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
N  15876         62.212
Y   9643         37.788
*******************************************************
In [21]:
feature = 'OWNS REAL ESTATE'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        2
top           Y
freq      17159
Name: OWNS REAL ESTATE, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
Y  17159         67.240
N   8360         32.760
*******************************************************
In [22]:
feature = '# CHILDREN'
gen_info(feature)
draw_box_plot(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count   25,519.000
mean         0.433
std          0.747
min          0.000
25%          0.000
50%          0.000
75%          1.000
max         19.000
Name: # CHILDREN, dtype: float64
*******************************************************
Value count:
    Count  Frequency (%)
0   17610         69.007
1    5249         20.569
2    2311          9.056
3     282          1.105
4      47          0.184
5      15          0.059
7       2          0.008
14      2          0.008
19      1          0.004
*******************************************************
In [23]:
feature='ANNUAL INCOME'
gen_info(feature)
draw_box_plot(feature)
draw_hist_plot(feature)
high_low_box_plot(feature)
*******************************************************
Description:
count      25,519.000
mean      187,022.766
std       101,869.023
min        27,000.000
25%       121,500.000
50%       157,500.000
75%       225,000.000
max     1,575,000.000
Name: ANNUAL INCOME, dtype: float64
*******************************************************
In [24]:
feature='EDUCATION'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count                             25519
unique                                5
top       Secondary / secondary special
freq                              17266
Name: EDUCATION, dtype: object
*******************************************************
Value count:
                               Count  Frequency (%)
Secondary / secondary special  17266         67.659
Higher education                6972         27.321
Incomplete higher                995          3.899
Lower secondary                  264          1.035
Academic degree                   22          0.086
*******************************************************
In [25]:
feature = 'FAMILY STATUS'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count       25519
unique          5
top       Married
freq        17575
Name: FAMILY STATUS, dtype: object
*******************************************************
Value count:
                      Count  Frequency (%)
Married               17575         68.870
Single / not married   3362         13.174
Civil marriage         2024          7.931
Separated              1487          5.827
Widow                  1071          4.197
*******************************************************
In [26]:
feature = 'RESIDENCE TYPE'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count                 25519
unique                    6
top       House / apartment
freq                  22820
Name: RESIDENCE TYPE, dtype: object
*******************************************************
Value count:
                     Count  Frequency (%)
House / apartment    22820         89.424
With parents          1222          4.789
Municipal apartment    772          3.025
Rented apartment       407          1.595
Office apartment       184          0.721
Co-op apartment        114          0.447
*******************************************************
In [27]:
feature = 'AGE'
gen_info(feature)
draw_box_plot(feature)
high_low_box_plot(feature)
draw_bar_plot(feature)
draw_hist_plot(feature)
*******************************************************
Description:
count   25,519.000
mean        43.290
std         11.512
min         21.000
25%         34.000
50%         42.000
75%         53.000
max         68.000
Name: AGE, dtype: float64
*******************************************************
In [28]:
feature = 'EMPLOYMENT LENGHT'
gen_info(feature)
draw_box_plot(feature)
high_low_box_plot(feature)
draw_bar_plot(feature)
draw_hist_plot(feature)
*******************************************************
Description:
count   25,519.000
mean        14.103
std         17.257
min          0.000
25%          3.000
50%          6.000
75%         15.000
max         50.000
Name: EMPLOYMENT LENGHT, dtype: float64
*******************************************************
In [29]:
feature = 'HAS A MOBILE PHONE'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        1
top           Y
freq      25519
Name: HAS A MOBILE PHONE, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
Y  25519        100.000
*******************************************************
In [30]:
feature = 'HAS A WORK PHONE'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        2
top           N
freq      19801
Name: HAS A WORK PHONE, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
N  19801         77.593
Y   5718         22.407
*******************************************************
In [31]:
feature = 'HAS A PHONE'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        2
top           N
freq      17980
Name: HAS A PHONE, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
N  17980         70.457
Y   7539         29.543
*******************************************************
In [32]:
feature = 'HAS AN EMAIL'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        2
top           N
freq      23207
Name: HAS AN EMAIL, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
N  23207         90.940
Y   2312          9.060
*******************************************************
In [33]:
feature = 'OCCUPATION'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count        17589
unique          18
top       Laborers
freq          4402
Name: OCCUPATION, dtype: object
*******************************************************
Value count:
                       Count  Frequency (%)
Laborers                4402         25.027
Core staff              2484         14.122
Sales staff             2407         13.685
Managers                2120         12.053
Drivers                 1499          8.522
High skill tech staff    991          5.634
Accountants              862          4.901
Medicine staff           831          4.725
Cooking staff            463          2.632
Security staff           408          2.320
Cleaning staff           374          2.126
Private service staff    249          1.416
Low-skill Laborers       120          0.682
Waiters/barmen staff     116          0.660
Secretaries              108          0.614
HR staff                  59          0.335
Realty agents             51          0.290
IT staff                  45          0.256
*******************************************************
In [34]:
feature = 'FAMILY SIZE'
gen_info(feature)
draw_box_plot(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count   25,519.000
mean         2.201
std          0.915
min          1.000
25%          2.000
50%          2.000
75%          3.000
max         20.000
Name: FAMILY SIZE, dtype: float64
*******************************************************
Value count:
    Count  Frequency (%)
2   13623         53.384
1    4875         19.103
3    4486         17.579
4    2203          8.633
5     270          1.058
6      43          0.169
7      14          0.055
9       2          0.008
15      2          0.008
20      1          0.004
*******************************************************
In [35]:
feature='ACCOUNT AGE'
gen_info(feature)
draw_box_plot(feature)
high_low_box_plot(feature)
draw_bar_plot(feature)
draw_hist_plot(feature)
*******************************************************
Description:
count   25,519.000
mean        21.275
std         14.870
min          1.000
25%          9.000
50%         18.000
75%         31.000
max         61.000
Name: ACCOUNT AGE, dtype: float64
*******************************************************
In [36]:
feature='HIGH RISK'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count     25519
unique        2
top           N
freq      25082
Name: HIGH RISK, dtype: object
*******************************************************
Value count:
   Count  Frequency (%)
N  25082         98.288
Y    437          1.712
*******************************************************

2.2 - Bivariate Analysis ¶

In [37]:
eda_df = eda_df.replace({'HAS A MOBILE PHONE':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HAS A WORK PHONE':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HAS A PHONE':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HAS AN EMAIL':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HIGH RISK':{'N':0,'Y':1}})
In [38]:
sns.pairplot(eda_df.drop(['ID', 'HAS A MOBILE PHONE', 'HAS A WORK PHONE', 'HAS A PHONE', 'HAS AN EMAIL'], axis=1), hue='HIGH RISK', corner=True);

'# CHILDREN' and 'FAMILY SIZE' are strongly correlated, and that make sense, the more child you have the bigger is your family. Same for 'AGE' and 'EMPLOYMENT LENGHT', the more you are old the longer its your career. Having a couple of feature that are correlated between each other could be a problem later on.


In [39]:
#Account Age and Applicant age
sns.jointplot(x = eda_df['ACCOUNT AGE'], y = eda_df['AGE'], kind="hex", height=12)
plt.show()

Most of the users are between 25 and 50 years old, and have an account that is not older that 25 months.


In [40]:
#Correlation
plt.figure(figsize=(25,10))
plt.title('Correlation Matrix',fontsize=25)
corr = eda_df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot=True, cmap='flare',mask=mask, linewidths=.5)
plt.show()

Children and Family size, age and employment are correlated with each other as we already knew. We don't have any feature correlated with HIGH RISK.


In [41]:
#How age affects the other variables
fig, axes = plt.subplots(4,2,figsize=(30,25),dpi=250)
fig.tight_layout(pad=9.0)
sns.boxplot(ax=axes[0,0], x=eda_df['GENDER'], y=eda_df['AGE']);
sns.boxplot(ax=axes[0,1], x=eda_df['OWNS REAL ESTATE'], y=eda_df['AGE']);
sns.boxplot(ax=axes[1,0], x=eda_df['HAS A CAR'], y=eda_df['AGE']);
sns.boxplot(ax=axes[1,1], x=eda_df['RESIDENCE TYPE'], y=eda_df['AGE']);
sns.boxplot(ax=axes[2,0], x=eda_df['AGE'], y=eda_df['FAMILY STATUS']);
sns.boxplot(ax=axes[2,1], x=eda_df['AGE'], y=eda_df['INCOME TYPE']);
sns.boxplot(ax=axes[3,0], x=eda_df['AGE'], y=eda_df['EDUCATION']);
sns.boxplot(ax=axes[3,1], x=eda_df['AGE'], y=eda_df['OCCUPATION']);
In [42]:
#How income affects the other variables
fig, axes = plt.subplots(4,2,figsize=(30,25),dpi=250)
fig.tight_layout(pad=9.0)
sns.boxplot(ax=axes[0,0], x=eda_df['GENDER'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[0,1], x=eda_df['OWNS REAL ESTATE'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[1,0], x=eda_df['HAS A CAR'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[1,1], x=eda_df['RESIDENCE TYPE'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[2,0], x=eda_df['ANNUAL INCOME'], y=eda_df['FAMILY STATUS']);
sns.boxplot(ax=axes[2,1], x=eda_df['ANNUAL INCOME'], y=eda_df['INCOME TYPE']);
sns.boxplot(ax=axes[3,0], x=eda_df['ANNUAL INCOME'], y=eda_df['EDUCATION']);
sns.boxplot(ax=axes[3,1], x=eda_df['ANNUAL INCOME'], y=eda_df['OCCUPATION']);

2.3 - CHI-SQUARE TEST ¶

The correlation analysis we did previously told us that our target variable is not highly correlated with any feature present in our dataset, but we need to know if any of them have some sort of effect on 'HIGH RISK' in order to build a correct model later on. In other words, we want to test whether the occurrence of a specific feature and the occurrence of a specific class are independent or not.

The Chi-square test serve this porpouse, as is used in statistics to test the independence of two events. Chi-Square measures how expected count 'E' and observed count 'O' deviates each other. When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of independence is incorrect. This means that the higher the Chi-Square value the more dependent the feature is, and it can be selected for model training.

In the hypotesis testing the null hypothesis will be: 'The feature has no effect on target variable'. If the p-value will be higher than the alpha (99% confidence level) the null hypothesis will be accepted, otherwise it will be refuted.

In [43]:
feature_test = ['GENDER','HAS A CAR','OWNS REAL ESTATE','INCOME TYPE','EDUCATION','FAMILY STATUS','RESIDENCE TYPE','OCCUPATION']
chi_sq_dict = {}
for ft in feature_test:
    chi_square_test(chi_sq_dict, ft)
******************** GENDER ********************
Observed values:

   Count
F    271
M    166
*******************************************************
Expected values:

    Count
F 218.500
M 218.500


Chi-square:

25.22883295194508


Critical value:

6.6348966010212145


P-value:

[5.09152992e-07]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "GENDER" HAS EFFECT ON TARGET


******************** HAS A CAR ********************
Observed values:

   Count
N    276
Y    161
*******************************************************
Expected values:

    Count
N 218.500
Y 218.500


Chi-square:

30.263157894736842


Critical value:

6.6348966010212145


P-value:

[3.77223487e-08]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "HAS A CAR" HAS EFFECT ON TARGET


******************** OWNS REAL ESTATE ********************
Observed values:

   Count
N    182
Y    255
*******************************************************
Expected values:

    Count
N 218.500
Y 218.500


Chi-square:

12.194508009153319


Critical value:

6.6348966010212145


P-value:

[0.0004793]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "OWNS REAL ESTATE" HAS EFFECT ON TARGET


******************** INCOME TYPE ********************
Observed values:

                      Count
Commercial associate     97
Pensioner                90
State servant            23
Working                 227
*******************************************************
Expected values:

                       Count
Commercial associate 109.250
Pensioner            109.250
State servant        109.250
Working              109.250


Chi-square:

199.76887871853546


Critical value:

11.344866730144373


P-value:

[0.]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "INCOME TYPE" HAS EFFECT ON TARGET


******************** EDUCATION ********************
Observed values:

                               Count
Higher education                 111
Incomplete higher                 22
Lower secondary                    8
Secondary / secondary special    296
*******************************************************
Expected values:

                                Count
Higher education              109.250
Incomplete higher             109.250
Lower secondary               109.250
Secondary / secondary special 109.250


Chi-square:

482.7711670480549


Critical value:

11.344866730144373


P-value:

[0.]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "EDUCATION" HAS EFFECT ON TARGET


******************** FAMILY STATUS ********************
Observed values:

                      Count
Civil marriage           33
Married                 280
Separated                18
Single / not married     77
Widow                    29
*******************************************************
Expected values:

                      Count
Civil marriage       87.400
Married              87.400
Separated            87.400
Single / not married 87.400
Widow                87.400


Chi-square:

553.6521739130434


Critical value:

13.276704135987622


P-value:

[0.]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "FAMILY STATUS" HAS EFFECT ON TARGET


******************** RESIDENCE TYPE ********************
Observed values:

                     Count
Co-op apartment          2
House / apartment      385
Municipal apartment     20
Office apartment         4
Rented apartment         6
With parents            20
*******************************************************
Expected values:

                     Count
Co-op apartment     72.833
House / apartment   72.833
Municipal apartment 72.833
Office apartment    72.833
Rented apartment    72.833
With parents        72.833


Chi-square:

1609.8787185354695


Critical value:

15.08627246938899


P-value:

[0.]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "RESIDENCE TYPE" HAS EFFECT ON TARGET


******************** OCCUPATION ********************
Observed values:

                       Count
Accountants               16
Cleaning staff             4
Cooking staff              6
Core staff                47
Drivers                   35
HR staff                   1
High skill tech staff     21
IT staff                   3
Laborers                  74
Low-skill Laborers         5
Managers                  32
Medicine staff             9
Private service staff      1
Sales staff               36
Secretaries                2
Security staff            10
Waiters/barmen staff       1
*******************************************************
Expected values:

                       Count
Accountants           17.824
Cleaning staff        17.824
Cooking staff         17.824
Core staff            17.824
Drivers               17.824
HR staff              17.824
High skill tech staff 17.824
IT staff              17.824
Laborers              17.824
Low-skill Laborers    17.824
Managers              17.824
Medicine staff        17.824
Private service staff 17.824
Sales staff           17.824
Secretaries           17.824
Security staff        17.824
Waiters/barmen staff  17.824


Chi-square:

381.5445544554455


Critical value:

31.999926908815176


P-value:

[0.]


WE REJECT THE NULL HYPOTHESIS: THE FEATURE "OCCUPATION" HAS EFFECT ON TARGET


In [44]:
#Sorted Chi-square value
sortdict=sorted(chi_sq_dict.items(),key=operator.itemgetter(1),reverse=True)
print(sortdict)
[('RESIDENCE TYPE', 1609.8787185354695), ('FAMILY STATUS', 553.6521739130434), ('EDUCATION', 482.7711670480549), ('OCCUPATION', 381.5445544554455), ('INCOME TYPE', 199.76887871853546), ('HAS A CAR', 30.263157894736842), ('GENDER', 25.22883295194508), ('OWNS REAL ESTATE', 12.194508009153319)]

2.4 CONCLUSION ¶

The average applicant is a woman (67%) who doesn't have a car (62%) but owns real estate (67%). She lives in a house/apartment (89%), she is married (68%), and either she doesn't have a child or no more than one (90%).

She earns almost 187,000 every year. She has a secondary degree (68%), started working 14 years ago, and is 43. Furthermore, she has a mobile phone (100%) but doesn't have a phone (70%), a work phone (77%) or an email (90%). She is a laborer (25%) and has been a client for 20 months. She is not considered a high-risk applicant (only 2% are).

98% of all applicants are low-risk, and that could be a problem later on. Since the number of high-risk applicants is so low, it has been difficult to figure out who is the average high-risk user. There seems to be a pattern related to age: younger people tend to have less job experience, which leads to a lower income, so it is more likely for younger people to struggle with debts.

So, to understand a little about who could be more likely to be a high-risk user, let's see how the variables are affected by age:
Men tend to be younger than women. Younger people are more likely to not own real estate, but they have a car. They are more likely to live with their parents, be single and have an incomplete education.

The same process has been applied to income:
It emerges that men have, on average, a higher income. Wealthier applicants own real estate, live in house/apartments, most of them are managers or real estate agents. People with lower incomes are students, pensioners, and younger people with an incomplete or low education. People with fewer working skills and less job experience.

3. MACHINE LEARNING ¶

3.1 - DATA CLEANING ¶

Features that need to be dropped:

  • ID: it serves no purpose
  • #CHILDREN: to prevent multicollinearity
  • HAS A MOBILE PHONE: it serves no purpose
  • OCCUPATION: it is a feature with a lot of null values. There is no way to replace those, and dropping all the rows affected by it could be too expensive for the model
  • ACCOUNT AGE: it could make the model overfit the data

Features that need one-hot encoding:

  • GENDER
  • HAS A CAR
  • OWNS REAL ESTATE
  • FAMILY STATUS
  • RESIDENCE TYPE
  • HAS A PHONE
  • HAS A WORK PHONE
  • HAS AN EMAIL

Feature that needs ordinal encoding:

  • EDUCATION

Features that need normalization:

  • ANNUAL INCOME
  • AGE
  • EMPLOYMENT LENGHT
  • FAMILY SIZE

Features with skewed data, that need to be reduced:

  • ANNUAL INCOME
  • EMPLOYMENT LENGHT
  • FAMILY SIZE

Features with outliers:

  • ANNUAL INCOME
  • EMPLOYMENT LENGHT
  • FAMILY SIZE
In [45]:
train=DataPreprocessing(train_og_copy)
In [46]:
#We separate train data X from the target y
y_train = train['HIGH RISK']
X_train = train.drop(['HIGH RISK'], axis=1)

3.2 - Building and testing promising model ¶

In [47]:
#List of promising models
models = {
    'sgd':SGDClassifier(random_state=42,loss='perceptron'),
    'logistic_regression':LogisticRegression(random_state=42,max_iter=1000),
    'decision_tree':DecisionTreeClassifier(random_state=42),
    'random_forest':RandomForestClassifier(random_state=42),
    'gaussian_naive_bayes':GaussianNB(),
    'k_nearest_neighbors':KNeighborsClassifier(),
    'gradient_boosting':GradientBoostingClassifier(random_state=42),
    'linear_discriminant_analysis':LinearDiscriminantAnalysis(),
    'bagging':BaggingClassifier(random_state=42),
    'adaboost':AdaBoostClassifier(random_state=42),
    'extra_trees':ExtraTreesClassifier(random_state=42),
    'xgboost':XGBClassifier(random_state=42)
    }
In [48]:
# loop over all the models
for model_name,model in models.items():
    # title formatting
    print('\n')
    print('\n')
    print('  {}  '.center(50,'-').format(model_name))
    print('\n')
    # check if the folder for saving the model exists, if not create it
    folder_check()
    # train the model
    model_trn = train_model(model,model_name)
    # print the scores from the classification report
    score(model_trn, model_name)
    # plot the ROC curve
    roc_curve(model_trn,model_name)
    # plot the confusion matrix
    confusion_matrix(model_trn,model_name)
    # plot feature importance
    feature_importance_plot(model_trn, model_name)
    warnings.filterwarnings("ignore")



----------------------  sgd  ----------------------


              precision    recall  f1-score   support

           0       0.56      0.57      0.57     24746
           1       0.56      0.56      0.56     24746

    accuracy                           0.56     49492
   macro avg       0.56      0.56      0.56     49492
weighted avg       0.56      0.56      0.56     49492



No feature importance for sgd






----------------------  logistic_regression  ----------------------


              precision    recall  f1-score   support

           0       0.57      0.56      0.56     24746
           1       0.57      0.57      0.57     24746

    accuracy                           0.57     49492
   macro avg       0.57      0.57      0.57     49492
weighted avg       0.57      0.57      0.57     49492









----------------------  decision_tree  ----------------------


              precision    recall  f1-score   support

           0       0.98      0.98      0.98     24746
           1       0.98      0.98      0.98     24746

    accuracy                           0.98     49492
   macro avg       0.98      0.98      0.98     49492
weighted avg       0.98      0.98      0.98     49492









----------------------  random_forest  ----------------------


              precision    recall  f1-score   support

           0       0.99      0.99      0.99     24746
           1       0.99      0.99      0.99     24746

    accuracy                           0.99     49492
   macro avg       0.99      0.99      0.99     49492
weighted avg       0.99      0.99      0.99     49492









----------------------  gaussian_naive_bayes  ----------------------


              precision    recall  f1-score   support

           0       0.71      0.07      0.13     24746
           1       0.51      0.97      0.67     24746

    accuracy                           0.52     49492
   macro avg       0.61      0.52      0.40     49492
weighted avg       0.61      0.52      0.40     49492



No feature importance for gaussian_naive_bayes






----------------------  k_nearest_neighbors  ----------------------


              precision    recall  f1-score   support

           0       0.98      0.95      0.97     24746
           1       0.95      0.98      0.97     24746

    accuracy                           0.97     49492
   macro avg       0.97      0.97      0.97     49492
weighted avg       0.97      0.97      0.97     49492



No feature importance for k_nearest_neighbors






----------------------  gradient_boosting  ----------------------


              precision    recall  f1-score   support

           0       0.86      0.94      0.90     24746
           1       0.93      0.85      0.89     24746

    accuracy                           0.89     49492
   macro avg       0.90      0.89      0.89     49492
weighted avg       0.90      0.89      0.89     49492









----------------------  linear_discriminant_analysis  ----------------------


              precision    recall  f1-score   support

           0       0.57      0.56      0.56     24746
           1       0.57      0.57      0.57     24746

    accuracy                           0.57     49492
   macro avg       0.57      0.57      0.57     49492
weighted avg       0.57      0.57      0.57     49492









----------------------  bagging  ----------------------


              precision    recall  f1-score   support

           0       0.99      0.99      0.99     24746
           1       0.99      0.99      0.99     24746

    accuracy                           0.99     49492
   macro avg       0.99      0.99      0.99     49492
weighted avg       0.99      0.99      0.99     49492



No feature importance for bagging






----------------------  adaboost  ----------------------


              precision    recall  f1-score   support

           0       0.75      0.78      0.77     24746
           1       0.77      0.74      0.76     24746

    accuracy                           0.76     49492
   macro avg       0.76      0.76      0.76     49492
weighted avg       0.76      0.76      0.76     49492









----------------------  extra_trees  ----------------------


              precision    recall  f1-score   support

           0       0.99      0.99      0.99     24746
           1       0.99      0.99      0.99     24746

    accuracy                           0.99     49492
   macro avg       0.99      0.99      0.99     49492
weighted avg       0.99      0.99      0.99     49492









----------------------  xgboost  ----------------------


              precision    recall  f1-score   support

           0       0.99      0.99      0.99     24746
           1       0.99      0.99      0.99     24746

    accuracy                           0.99     49492
   macro avg       0.99      0.99      0.99     49492
weighted avg       0.99      0.99      0.99     49492





3.3 - How to choose the right model ¶

Now a decision needs to be made. We have a handful of promising models, but we need to find the one that best fits our needs. For a credit card company, a good model should minimize the risk of marking as low risk an applicant who's actually not low risk, or in other words, the model should have the lowest false negative score. The XGBOOST matches the description since it is the one with better precision.

In some situations, however, companies could make different decisions. For example, let's think of a scenario in which the economy is going great, the salaries are growing so as the spending, so the money is flowing. In a situation like that, a credit card company could be more interested in maximizing the number of legitimate users rather than minimizing the risk of a couple of false negative, since a greater user base could make the easier for the company to handle a handful of risky accounts. In a situation like that, maybe another model, with a higher recall rather than a good precision, could do a better job.

Good precision and good recall are mutually exclusive, so when in need of choosing between a couple of good models, we must figure out which is more important in solving the specific problem. For this project, I imagined the second scenario, so I chose the gradient boosting model.

In [49]:
test = DataPreprocessing(test_og_copy)
In [50]:
X_test = test.drop(['HIGH RISK'],axis=1)
y_test = test['HIGH RISK']
In [51]:
model = train_model(models['gradient_boosting'],'gradient_boosting')
In [52]:
predictions = model.predict(X_test)
In [53]:
n_correct = sum(predictions == y_test)
In [54]:
print(f'The model was able to correctly classify the data {(round(n_correct/len(predictions),4))*100}% of the time')
The model was able to correctly classify the data 86.2% of the time

4. Final Result ¶

We started by looking at the data to see how it was put together. This phase was crucial not only because it allowed us to define the typical client profile, but it also allowed us to identify the "flaws" in the data set, which is vital because it simplifies the data cleaning process, which is the first phase of any machine learning project. Working with missing data, multi-correlated features, and failing to consider how the data is distributed... could compromise even the best model, making it useless.

After understanding, cleaning, and preprocessing the data, we used a percentage of the data we had (train set) to precisely "train" some models, and then decided which one was the best based on their performance. As previously stated, there are numerous factors to consider when making this decision. The XGBOOST model was chosen in this case because it provided the best precision (thus the model with the lowest number of false negatives).

It was eventually time to test the model with the portion of the data set that we had set aside (test data) from the start in order to avoid any bias that could invalidate the entire project.

Is it possible to use Machine Learning models to rank customers based on their likelihood of defaulting on their debts?
Our model correctly classifies 86 percent of the time, so the answer is likely to be yes. This result could be possibly improved by optimizing the hyperparameters of our models in order to improve their performance, or by implementing deep learning and neural networks.